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Abstract 

This  paper  explores  the  enhancement  by  locality  con¬ 
straint  to  both  learning  and  coding  schemes,  more  specif¬ 
ically,  discriminative  low-rank  dictionary  learning  and 
auto-encoder.  Previous  Fisher  discriminative  based  dic¬ 
tionary  learning  has  led  to  interesting  results  by  learning 
more  discerning  sub -dictionaries.  Also,  the  low-rank  reg¬ 
ularization  term  has  been  introduced  to  take  advantage  of 
the  global  structure  of  the  data.  However,  such  methods 
fail  to  consider  data ’s  intrinsic  manifold  structure.  To  this 
end,  first,  we  apply  locality  constraint  on  dictionary  learn¬ 
ing  to  explore  whether  the  identification  capability  will  be 
enhanced  or  not  by  using  the  geometric  structure  informa¬ 
tion.  Moreover,  inspired  by  the  recent  advances  from  auto¬ 
encoders  for  learning  compact  feature  spaces,  we  propose 
a  locality-constrained  collaborative  auto-encoder  (LCAE) 
for  feature  extraction.  The  improvement  from  applying  lo¬ 
cality  to  dictionary  learning  and  auto-encoder  is  evaluated 
on  several  datasets.  Experimental  results  have  demonstrat¬ 
ed  the  effectiveness  of  locality  information  compared  with 
state-of-the-art  methods. 


Input 


Target  | 


Reconstruct 
'  the  target  using 
locality  constraint 
iyiUi-Dc,ii2+Aiiii0Ciii^ 


Figure  1.  Illustration  of  our  methods.  The  locality  constrain- 
t  is  adopted  in  both  dictionary  learning  (DL)  and  auto-encoder 
(AE)  schemes.  For  DL,  the  negative  effect  of  noise  contained  in 
training  samples  is  narrowed  by  learning  low-rank  sub-dictionary. 
For  AE,  the  target  is  reconstructed  to  be  consistent  with  locality- 
constrained  criterion. 


1.  Introduction 

Recent  researches  have  led  to  the  rapid  growth  in  the 
theory  and  application  of  sparse  representation  and  demon¬ 
strated  its  promising  results  in  face  recognition  and  image 
classification  etc.  The  key  idea  is  to  find  a  representation 
for  each  input  signal  using  atoms  from  a  given  dictionary  D 
as  a  linear  combination.  Thus,  the  quality  of  dictionary  is  a 
critical  factor  for  sparse  representation. 

A  problem  arising  with  directly  using  the  original  train¬ 
ing  samples  as  the  dictionary  [26]  is  that,  the  test  samples 
could  not  be  faithfully  represented  owing  to  the  noise  and 
ambiguity  in  the  dictionary.  In  addition,  this  strategy  will 
ignore  the  discerning  information  hidden  behind  the  train¬ 
ing  samples.  Actually,  the  mentioned  problems  above  can 
be  solved  by  learning  a  proper  dictionary  from  the  origi¬ 


nal  training  samples.  The  intention  of  dictionary  learning 
is  to  learn  a  set  of  basis  from  the  training  data  where  we 
could  well  represent  the  given  signal.  The  recognition  rate 
of  image  classification  has  been  improved  significantly  with 
a  well-adapted  dictionary.  A  lot  of  research  efforts  have 
been  made  in  order  to  seek  a  well-learned  dictionary  for 
distinctive  representing  the  test  samples.  Recently,  based 
on  K-SVD  [1],  a  discriminative  constraint  was  added  to  the 
dictionary  learning  model  that  considers  classification  er¬ 
ror  in  order  to  gain  discriminability  [31].  Jiang  et  al.  en¬ 
forced  discerning  ability  by  associating  label  information 
with  each  dictionary  atom  [8].  For  learning  a  structured 
dictionary,  the  Fisher  criterion  was  introduced  to  make  sub¬ 
dictionaries  according  to  different  class  labels  [27]. 

The  algorithms  above,  however,  only  work  well  for  the 


17 


situation  that  the  signals  are  clear  or  corrupted  by  smal- 
1  noise.  If  the  training  samples  are  corrupted  with  large 
noise,  the  dictionary  atoms  will  be  introduced  corruptions 
resulting  in  representing  the  training  samples. 

Recently,  low-rank  representation  [15]  has  been  success¬ 
fully  applied  to  unsupervised  subspace  segmentation  [14], 
object  detection  [23],  and  3D  visual  recovery  [29].  From 
corrupted  input  data,  it  determines  a  low-rank  matrix.  If  a 
given  matrix  Y  in  which  each  atom  shares  the  same  pattern 
and  corrupted  by  a  sparse  noisy  matrix  E,  via  rank  min¬ 
imization,  Y  could  be  practically  recovered  while  sparse 
noisy  E  is  removed.  As  for  the  case  that  using  dictionary 
learning  to  deal  with  face  recognition,  the  within-class  sam¬ 
ples  are  drawn  from  a  low-dimensional  subspace  and  linear¬ 
ly  correlated.  Therefore,  each  sub-dictionary  for  represent¬ 
ing  with-in  class  samples  should  reasonably  be  low-rank. 
Inspired  by  the  previous  work,  low-rank  regularization  was 
integrated  into  sparse  representation  so  that  the  sparse  nois¬ 
es  were  separated  from  inputs  while  the  dictionary  atoms 
were  simultaneously  optimized  to  reconstruct  the  de-noised 
signals.  The  DLRD  [16]  algorithm  achieves  impressive  re¬ 
sults  especially  when  corruption  existed. 

Previous  sparse  representation  based  approaches  assume 
that  each  sample  has  independent  sparse  linear  combina¬ 
tion,  which  ignores  the  spatial  consistency  of  neighbor 
points  and  fails  to  utilize  the  relationship  between  similar 
samples.  Recent  studies  have  witnessed  more  promising 
results  using  the  idea  of  locality  on  the  task  of  classifica¬ 
tion  [25].  They  presented  method  names  Local  Coordinate 
Coding  (LCC),  a  modification  to  sparse  coding,  which  the¬ 
oretically  proved  that  locality  is  more  essential  than  sparsi¬ 
ty  under  certain  assumptions,  and  the  coding  is  encouraged 
specifically  to  rely  on  local  structure.  Since  then,  sever¬ 
al  locality-constrained  coding  method  has  been  proposed 
to  replace  sparse  constraint  on  scene  categorization  [21], 
human  action  recognition  [6]  and  image  colorization  [13] 
problems. 

Motivated  by  above  techniques,  this  paper  explores  the 
enhancement  of  classification  by  adding  locality  constraint 
on  both  learning  and  coding  schemes,  especially  for  dis¬ 
criminative  low-rank  dictionary  learning  and  auto-encoder. 
First,  an  algorithm  with  low-rank  regularization  on  discrim¬ 
inative  sub-dictionary,  and  locality-constrained  on  coeffi¬ 
cients  is  introduced.  Second,  different  from  previous  lo¬ 
cality  linear  coding  works  [22,  21],  we  study  the  locality 
on  more  complicated  auto-encoder  method  to  further  study 
the  performance  of  locality.  A  locality-constrained  collab¬ 
orative  auto-encoder  (LCAE)  is  proposed  to  extract  feature 
with  local  information  for  enhancing  the  classification  a- 
bility.  our  paper’s  main  contributions  are:  1)  we  investi¬ 
gate  the  impact  of  locality  constraint  on  dictionary  learning 
and  improve  the  results  on  several  benchmark  datasets,  2)  a 
locality-constrained  collaborative  auto-encoder  (LCAE)  is 


proposed  to  provide  features  with  intrinsic  local  informa¬ 
tion. 

The  rest  of  this  paper  is  structured  as  follows.  Our 
proposed  locality-constrained  dictionary  learning  method 
and  its  optimization  solution  are  presented  in  Section  2. 
Section  3  introduces  the  locality-constrained  collaborative 
auto-encoder  (LCAE)  model.  Section  4  shows  the  exper¬ 
iments  and  analysis  along  with  the  drawn  conclusions  in 
Section  5. 

2.  Locality-constrained  Discriminative  Low- 
Rank  DL  (LC-LRD) 

We  first  briefiy  review  a  discriminative  dictionary  learn¬ 
ing  algorithm  with  low-rank  regularization  [16],  in  order  to 
improve  the  performance  even  when  large  noise  exists  in  the 
training  samples.  Moreover,  locality  constraint  is  added  to 
take  place  of  sparse  coding  to  exploit  the  manifold  structure 
of  local  features  in  a  more  thorough  manner. 

2.1.  Discriminative  Low-Rank  DL 

Given  a  training  dataset  Y  =  [Yi,  I2,  •  •  • ,  ^  ^ 

^dxN ^  where  c  is  the  number  of  classes,  d  denotes  the  fea¬ 
ture  dimension,  N  is  the  total  training  samples’  number,  and 
Yi  G  is  the  samples  from  class  i  which  has  rii  sam¬ 

ples.  Erom  Y,  we  want  to  learn  a  discriminative  dictionary 
D  and  the  coding  coefficient  X,  which  is  utilized  to  future 
classification  task.  Then  Y  is  equal  to  DX  +  E,  with  E 
as  the  noises.  Different  from  using  all  the  training  samples 
to  learn  a  whole  dictionary,  each  sub-dictionary  Di  for  the 
i-th  class  is  learned  separately.  Then  X  and  D  could  be  rep¬ 
resented  as  [Ai, X2, . . . ,  Xc]  and  [Di,  D2, . . . ,  Dd,  where 
Di  denotes  the  sub-dictionary  for  corresponding  class,  and 
Xi  is  the  partial  coefficients  over  D  to  represent  Yi . 

Sub-dictionary  Di  should  be  endowed  with  the  discrim- 
inability  of  well  represent  samples  from  i-th  class.  Using 
mathematical  formula,  Y^’s  coding  coefficients  Xi  on  D  can 
be  written  as  [Xj^Xf; . . . ;  Af],  in  which  Xj  is  Yds  coef¬ 
ficient  matrix  on  Dj.  The  discerning  power  of  Di  comes 
from  following  two  aspects:  first,  Yi  is  expected  to  be  well 
represented  by  Di  rather  than  by  Dj,  j  7^  i.  Therefore, 
it  is  reasonable  to  minimize  \\Yi  —  DiX]  —  EiW'^p.  At  the 
meanwhile,  Di  is  not  suppose  to  be  good  at  representing 
other  classes’  samples,  that  is  each  Aj,  where  j  7^  i  should 
have  nearly  zero  value  so  that  IjD^Aj  |||.  is  as  small  as  pos¬ 
sible.  Thus  we  denote  the  discriminative  fidelity  term  for 
sub-dictionary  Di  as  follows: 

c 

R{Di,Xi)  =  \\Yi-DiXl-Ei\\l+  ^  WDiXifp.  (1) 

In  the  task  of  dealing  with  face  images,  the  within-class 
samples  consist  in  a  low  dimensional  manifold  and  are 
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linearly  dependent.  Therefore,  sub-dictionaries  should  be 
properly  trained  as  low-rank  to  represent  samples  from  the 
same  class.  To  this  end,  we  want  to  find  the  one  with  the 
most  concise  atoms  from  all  the  possible  sub-dictionary  Di, 
that  is  to  minimize  the  rank  of  Di .  Recent  researches  sug¬ 
gest  that  the  rank  function  can  be  replaced  by  the  convex 
surrogate  [4],  that  is  HAH*,  where  ||.||*  is  the  sum  of  sin¬ 
gular  values  of  the  matrix,  called  nuclear  norm. 

2.2.  Locality  constraint 

In  this  paper,  we  deploy  locality  constraint  on  the  coeffi¬ 
cient  matrix  instead  of  the  sparsity  constraint.  As  indicated 
by  LCC  [28],  compared  to  sparsity,  locality  is  more  indis¬ 
pensable  under  certain  assumptions.  That  is  because  locali¬ 
ty  constraint  results  in  sparsity  but  not  necessary  vice  versa. 
Specifically,  the  locality  constraint  uses  following  criteria: 

mm\\li  Q  Xi\\‘^ ,  s.t.  Xi  =  (2) 

X 

where  li  G  is  the  locality  adapter,  and  ©  represents  dot 
product.  According  to  each  basis  vector’s  similarity  to  the 
input  sample  yi,  U  gives  each  one  different  weight.  Specifi¬ 
cally, 


(T 


). 


(3) 


where  dist(^i,  i2)=[dist(^i,  di), . . . ,  dist(^i,  dk)Y ,  and 
dist(^i,dj)  is  the  Euclidiean  distance  between  sample  yi 
and  each  dictionary  atom  dj.  cr  controls  the  bandwidth  of 
the  distribution. 


2.3.  Our  proposed  model 

Considering  the  low-rank  regularization  term  on  the  dis¬ 
criminative  sub-dictionaies  and  the  locality-constrained  on 
the  coding  coefficients  all  together,  we  have  the  following 
LC-LRD  model  for  each  sub-dictionary: 

min  R{Di,Xi)  +  a\\Di\U+p\\Ei\\i 

X'i  ,^i  ,-tji  (4) 

+A  Wkk  0  s.t.  y,  =  DXi  +  Ei 

Basically  speaking,  LC-LRD  is  based  on  the  following 
three  observations: 


2.4.  Optimization  of  LC-DLRD 

We  consider  dividing  Eq.(4)  into  two  sub-problems  to 
solve  the  proposed  objective  function:  Lirst  updating  each 
coefficient  Xi{i  =  1,  2, . . . ,  c)  one  by  one  by  fixing  all  oth¬ 
er  Xj{j  7^  i)  and  dictionary  D  then  putting  together  to 
produce  the  coefficient  matrix  X ;  Second,  updating  by 
fixing  others.  The  locality-constrained  coefficients  Xi,  the 
discriminative  low-rank  sub-dictionary  Di,  and  the  sparse 
error  Ei  are  obtained  by  iteratively  operating  this  two  steps. 


Algorithm  2.4  Updating  coefficients  via  ALM 
Input:  Training  data  Yi,  Initial  dictionary  D, 
Parameters  X,  a,  /Si 

Initialize:^  =  Ei  =  P  =  0,  y  =  yrnax  =  10^^, 
e  =  10“^,  p  =  l.l^maxiter  =  10^, iter  =  0 
while  not  converges  and  iter  <  maxiter  do 

1 .  Lix  others  and  update  Z  by: 

Z  =  Yi-  EiP  ^ 

2.  Lix  others  and  update  Xi  by: 

Xi  =LLC(Z,D,A,cr)i 

2.  Lix  others  and  update  Ei  by: 

Ei  =  argmin(^||£’i||i+ 

\\\Ei-{Yi-DXi  +  ^)\\l) 

3.  Update  multipliers  P  by: 

P  =  PYy{Yi-DXi-Ei) 

4.  Update  y  by: 

/i  =  min{pid,  iimax) 

5.  Check  if  it  is  converged: 

\\Yi-DXi-Ei\\^<e 
end  while 
output:  Xi,  Ei 


Assume  that  the  discriminated  dictionary  D  is  given  in  the 
first  sub-problem,  the  coefficients  Xi{i  =  1,  2, . . . ,  c)  is 
updated  one  after  another,  then  the  original  objective  func¬ 
tion  Eq.(4)  reduces  to  locality-constrained  coding  problem 
as  follow: 


1 .  The  discriminative  term  is  introduced  to  give  the  dis¬ 
cerning  ability  to  each  sub-dictionary, 

2.  Each  sub-dictionary  should  be  low-rank  to  separate 
noise  from  samples  and  discover  the  latent  structure, 

3.  Inspired  by  [25]  and  the  above  discussions,  locality  is 
more  essential  than  sparsity.  That  is  similar  samples 
should  have  similar  representations. 


min  XJ^kUllkk  ®  + /3i\\Ei\\i 

-Ai  ,il/i 

S.t.  Yi  =  DXi  +  Ei 


(5) 


which  can  be  solved  by  the  following  ALM  method  [3]. 


^We  set  Z,  D,  X  and  a  as  the  input  of  LLC  [25]  and  the  code  can  be 
downloaded  from  http://www.ifp.illinois.edu/  jyang29/LLC.htm. 
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min  \  +  pi\\Ei\\i 

+  <P,  (Yi  -  DXi  -  Ei)  >  +^\\Yi  -  DXi  -  Ei\\% 

=  min  XElLi\\kkexi,kf+MEi\\,  +  !l\\Z-DXi\\l 

(6) 

where  Z  denotes  to  Yi  —  Ei  ^  P/ //,  /i  is  a  positive  penal¬ 
ty  parameter,  P  denotes  the  Lagrange  multiplier  and  k  = 
exp{dist{zi,  D)/a).  Different  from  traditional  locality- 
constrained  linear  coding  (LLC)  [25],  we  add  an  error  term 
which  could  handle  large  noise  in  samples. 

The  detail  of  the  coefficient  updating  can  be  referred 
to  Algorithm  (2.4).  For  the  procedure  of  updating  sub¬ 
dictionary,  we  have  the  same  method  with  [16]. 

2.5.  Classification  based  on  our  model 

A  linear  classifier  is  used  for  final  classification.  In  previ¬ 
ous  training  process,  the  dictionary  is  learned,  the  locality- 
constrained  coefficients  X  of  training  data  Y  and  Atest  of 
test  data  TJest  are  calculated.  The  test  sample  i’s  representa¬ 
tion  Xi  is  Atest’s  i-th  column  vector.  The  linear  classifier  V 
is  obtained  by  a  multivariate  ridge  regression  model  [30]: 

V  =  w^gmm\\L-VX\\l+j\\V\\l  (7) 

where  L  is  the  class  label  matrix  for  Y.  This  produces  V 
=  LX^{XX^  +  When  testing  points  latest  comes 

in,  we  first  compute  I^Atest-  Then  label  for  sample  i  is  as¬ 
signed  by  the  position  corresponding  to  the  largest  value  in 
the  label  vector,  that  is:  label  =  argmax(^  =  Vxi). 

label 

3.  Locality-constrained  Collaborative  Auto- 
Encoder  (LCAE) 


where  n  is  the  number  of  samples,  x  is  the  target  and  h{xi) 
is  the  reconstructed  input.  By  this  means,  the  neurons  in  the 
hidden  layer  of  auto-encoder  are  able  to  reconstruct  the  data 
and  can  be  seen  as  a  good  representation  for  the  input. 

In  order  to  introduce  locality  into  the  coding  procedure, 
the  input  data  is  first  reconstructed  by  LLC  coding  criteria 
then  to  work  as  the  target  of  the  auto-encoder.  That  is  x 
in  Eq.  (9)  is  replaced  by  a  locality  reconstruction  which 
followed  as: 

-  Dcif  +  X\\li  Q  Cif 

^  rj.  (lU) 

s.t.  1  Ci  =  1,  Vi 

where  dictionary  D  will  be  initialized  by  PCA  on  the  in¬ 
put  training  matrix  A.  The  proposed  LCAE  can  be  trained 
using  the  backprop  algorithm,  which  updates  W  and  b  by 
back  propagation  the  reconstruction  error  gradient  from  the 
output  layer  to  the  locality  coded  target  layer.  After  the  it¬ 
eration  of  forward  and  backward  propagation,  the  locality 
coefficients  will  be  updated  using  new  output  layer  h{xi). 

4.  Experiments 

We  verify  the  performance  of  our  LC-LRD  and  LCAE  on 
various  visual  classification  applications  to  demonstrate  the 
efficiency  and  generality  of  the  proposed  methods.  First,  the 
LC-LRD  is  evaluated  on  four  datasets  including  two  face 
datasets:  Extend  YaleB  [11],  AR  [17],  one  object  catego¬ 
rization  dataset  COIL- 100  [19],  and  one  handwritten  dig¬ 
its  recognition  dataset  MNIST  [10].  Second,  Extend  Yale- 
B,  AR,  CMU  PIE  [24]  and  a  newly  built  Virtual  MakeUp 
(VMU)  dataset  [5]  (samples  shown  in  Eig.  3)  are  used  to 
evaluate  our  LCAE  method.  Experimental  results  will  be 
presented  with  some  analysis  in  this  section. 

4.1.  Experiments  on  LC-LRD 


Suppose  we  have  input  image  x  G  and  hidden  unit 
z  G  in  which  D  is  the  visual  descriptor’s  dimension. 
There  are  two  important  non-linear  transformation  in  the 
auto-encoder’s  feed-forward  process:  “input^hidden  unit¬ 
s'',  and  “hidden  units -Goutput"  as: 


2:*  =  a{WiXi  +  bi);  h{xi)  =  a{W2Zi  +  62)  (8) 

where  Wi  €  bi  e  R'^,  W2  €  62  e  R^, 

and  a  is  the  sigmoid  function  in  the  form  of  a{x)  = 
(1  +  Auto-encoder  is  basically  a  single  hidden  lay- 

er  neural  network,  in  which  the  input  and  target  have  same 
identity.  Consequently,  the  output  of  auto-encoder  is  en¬ 
couraged  to  be  as  similar  to  the  target  as  possible.  That  is. 


min  L(x)  =  min 
Wi,bi,W2,b2  Wi,bi,W2,b2 


^Y.wxi-hixMi 

i 


(9) 


Several  state-of-the-art  algorithms  were  compared  on 
each  dataset,  to  show  our  advantage,  including  LDA  [2], 
linear  regression  classification  (LRC)  [18]  and  several  lat¬ 
est  DL  based  classification  methods,  i.e.  EDDL  [27],  DL- 
RD  [16],  D^L^R^  [12]  and  DPL  [7].  In  each  experiment,  we 
keep  all  the  steps  the  same  as  that  of  the  baselines  except  for 
the  learning  stage  for  fair  comparison. 

Parameter  selection  One  of  the  most  important  param¬ 
eters  in  majority  of  dictionary  learning  methods  is  the  num¬ 
ber  of  atoms  in  every  sub-dictionary  which  denoted  by  m^. 
In  this  paper’s  experiments,  we  set  all  the  rrii  equal,  i  = 
1,  2, . . . ,  c.  We  analyze  the  effect  of  rui  on  the  performance 
of  LC-LRD,  D^L^R^,  DLRD,  FDDL  and  DPL.  We  take  Ex- 
tended  YaleB  as  an  example  (20  training  samples  per  class 
and  the  other  setting  is  given  in  next  subsection).  Eig.  2 
shows  the  accuracy  of  five  methods  versus  different  num¬ 
ber  of  dictionary  atoms.  We  can  see  that  all  methods  have 
an  increasing  performance  along  with  more  atoms,  and  in 
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♦  FDDL[Yang  et  al.,  2011] 
eDLRD[Maetal.,  2012] 
■►DDLLRR[Li  et  al.,  2014] 
■•♦■DPL[Gu  et  al.,2014] 
4-Ours 


8  10  12  14  16  18 

Number  of  dictionary  atoms 


Figure  2.  The  recognition  rates  of  five  DL  based  methods  versus 
the  number  of  dictionary  atoms  with  20  training  samples  per  class 
on  Extend  YaleB  dataset. 


all  cases  our  LC-LRD  method  has  nearly  2%  improvement 
over  other  methods.  Since  each  method’s  performance  has 
a  similar  trend  with  the  atoms’  increasing,  we  fix  the  num¬ 
ber  of  the  dictionary  columns  of  each  class  as  training  size 
for  all  of  following  experiments  except  for  MNIST  dataset, 
which  is  set  to  30  each  class.  We  will  study  the  infiuence  of 
neighbors’  number  k  used  for  approximated  LLC  in  exper¬ 
iments  on  LCAE,  and  in  this  section  k  is  set  equal  to  10  as 
suggested  in  [25]. 


Figure  3.  Sample  images  in  the  (a)  Extended  YaleB  with  10%, 
20%,  30%  random  pixel  corruption;  (b)AR  dataset;  (c)  COIL- 100 
and  (d)  VMU  datasets 


There  are  five  parameters  in  our  approach:  a,  A,  a  in  Eq. 
(4)  and  Pi,  P2  for  error  term  of  dictionary  updating  and  co¬ 
efficients  updating  separately.  In  the  experiments,  we  found 
that  pi  and  P2  make  more  difference  in  recognition,  there¬ 
fore,  other  parameters  a  and  A  are  set  as  1  in  this  paper. 
If  no  specific,  the  parameters  Pi,  p2  and  the  parameters  of 
other  compared  methods  are  chosen  by  5-fold  cross  valida¬ 
tion.  Eor  Extended  YaleB,  pi  =  15,  p2  =  100;  for  AR,  pi  = 
5,  p2  =  100;  for  COIL-100  ,  Pi  =  3,  P2  =  150;  for  MNIST, 
p^  =  2.5,  P2  =  2.5. 

The  two  face  recognition  datasets  and  splits  subsets  are 


downloaded  from  CAD  website^.  Through  these  dataset- 
s,  the  robustness  of  our  algorithm  to  illumination  changes, 
pose  variations  will  be  tested.  Eurthermore,  we  will  eval¬ 
uate  LC-LRD ’s  performance  to  noise  by  adding  pixel  cor¬ 
ruptions. 

Extended  YaleB  Dataset.  The  Extended  YaleB  dataset 
contains  2414  frontal-face  images  of  38  subjects  captured 
under  various  lighting  conditions.  Eor  each  class,  there 
are  between  59  and  64  images  for  each  person  normalized 
to  size  32  x  32.  This  dataset  is  diverse  due  to  differen- 
t  illumination  conditions,  therefore  we  denote  two  exper¬ 
iments  on  this  dataset.  Eirst,  we  choose  random  subsets 
with  p(=  5, 10,. ..,40)  images  per  subject  as  the  training 
set,  and  the  rest  of  the  dataset  formed  the  testing  set.  There 
are  10  randomly  splits  for  each  given  p;  Second,  a  certain 
percentage  of  randomly  selected  pixels  from  the  images  are 
replaced  by  setting  the  pixel  value  as  255  (show  in  Eig.  3 
(a)).  Then  randomly  take  30  images  as  training  samples, 
and  the  rest  as  testing  samples  and  also  repeat  the  experi¬ 
ment  ten  times.  These  two  experiments  results  are  given  in 
Table.  1  and  Table.  2  respectively. 

Table.  1  shows  the  recognition  rates  with  different  train¬ 
ing  size.  It  can  be  observed  that  under  all  situations  our 
method  archives  the  best  accuracy.  Our  method’s  robust¬ 
ness  to  noise  is  demonstrated  in  Table.  2,  that  along  with  the 
percentage  of  corruption  increases  our  algorithm  perform- 
s  the  best  constantly.  The  performance  of  EDDL  as  well  as 
DPL,  LRC  and  LDA  drops  rapidly,  by  contrast,  our  method, 
D^L^R^  and  DLRD  can  still  get  much  better  recognition  ac¬ 
curacies  under  different  levels  of  corruption.  This  demon¬ 
strates  the  effectiveness  of  low-rank  regularization  and  the 
error  term  when  noise  exists.  Comparing  with  D^L^R^  and 
DLRD,  our  method  still  performs  better  due  to  the  locali¬ 
ty  constraint  part,  especially  in  cases  that  the  occlusion  is 
small. 

AR  Dataset.  The  AR  dataset  consists  of  more  than  4,000 
frontal-face  images  of  126  subjects,  that  is  there  are  26  pic¬ 
tures  for  each  subject  taken  in  two  separated  sessions.  We 
follow  the  experimental  setting  in  [27],  for  fair  compari¬ 
son,  to  choose  50  male  subjects  and  50  female  subjects  as  a 
subset.  Eor  each  subject,  the  7  images  from  session  1  with 
illumination  and  expression  changes  were  used  for  training, 
and  the  other  7  images  from  session  2  under  the  same  con¬ 
dition  served  as  testing.  We  do  experiments  on  different 
features:  original  60  x  43  images,  resized  27  x  20  images 
and  the  feature  provided  by  [9] . 

We  illustrate  the  recognition  rates  under  different  feature 
in  Eig.  4.  Erom  the  figure,  we  can  observe  that  our  method 
achieves  the  best  results  on  all  the  features  and  the  improve¬ 
ment  is  larger  compared  with  which  on  YaleB  dataset.  This 
could  result  in  the  variation  of  AR  dataset  and  locality  is 
proved  to  be  better  on  dealing  with  this  kind  of  data. 

^  http :  //w  w  w.  cad.  zj  u .  edu .  cn/home/ dengcai/Data/FaceData  .html 
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Table  1.  Recognition  rate(%)  of  different  algorithms  on  Extended  YaleB  dataset  with  different  number  of  training  samples  per  class. 


Training  images 

DPL  [7] 

[12] 

DLRD  [16] 

FDDL  [27] 

LRC  [18] 

LDA  [2] 

Ours 

5 

75.17±1.86 

75.96±1.20 

76.17±1.16 

77.75±1.34 

60.24±2.02 

74.12±1.52 

78.62±1.20 

10 

89.31  ±0.62 

89.60±0.89 

89.94±0.89 

91.16±0.85 

82.98±0.82 

86.67±0.90 

92.07±0.89 

20 

95.69±0.90 

96.02±0.91 

96.03±0.85 

96.15±0.66 

91.80±0.97 

90.64±1.07 

97.86±0.91 

30 

97.80±0.36 

97.87±0.42 

97.90±0.47 

97.86±0.35 

94.60±0.60 

86.84±0.92 

99.23±0.47 

40 

98.67±0.43 

98.09±0.39 

98.80±0.37 

98.84±0.46 

96.10±0.58 

95.27±0.79 

99.54±0.44 

Table  2.  Recognition  rate(%)  of  different  algorithms  on  Extended  YaleB  dataset  with  various  corruption  percentage(%). 


Occlusions 

DPL  [7] 

[12] 

DLRD  [16] 

FDDL  [27] 

LRC  [18] 

LDA  [2] 

Ours 

0 

97.80±0.36 

97.87±0.42 

97.90±0.47 

97.86±0.35 

94.60±0.60 

86.84±0.92 

99.23±0.47 

5 

78.27±1.22 

91.90±1.14 

91.84±1.07 

63.55±0.87 

80.49±1.10 

29.03±0.82 

93.31±0.69 

10 

64.58±1.09 

85.71±1.51 

85.82±1.54 

44.65±1.22 

67.61±1.33 

18.53±1.15 

86.97±0.86 

15 

53.77±0.86 

80.46±1.64 

80.89±1.37 

32.76±1.03 

56.81±1.24 

13.63±0.53 

81.71±0.81 

20 

44.95±1.38 

73.59±1.54 

73.56±1.63 

25.26±0.42 

47.23±1.59 

11.30±0.46 

74.14±1.01 

25 

35.87±1.01 

65.93±1.15 

65.88±1.50 

18.45±0.82 

38.85±1.18 

9.23±0.81 

66.45±1.06 

Eigure  4.  Average  recognition  rate(%)  of  different  algorithms  on 
AR  dataset  with  three  different  features.  Eeature  1:  row  pixel  60  x 
43;  feature  2:  row  pixel  27  x  20;  feature  3:  feature  provided  by  [9] 


COIL- 100  Dataset.  In  this  section,  we  assess  our  ap¬ 
proach  on  object  categorization  by  using  the  COIL- 100 
dataset.  The  training  set  is  constructed  by  randomly  select¬ 
ed  10  images  per  object,  and  the  rest  of  the  images  consist 
the  testing  set.  We  repeat  this  random  selection  ten  times, 
and  the  average  results  with  standard  deviations  are  report¬ 
ed.  To  evaluate  the  scalability  of  our  method  and  compet¬ 
ing  methods,  we  separately  utilize  samples  of  20,  40,  60, 
80  and  100  objects  in  this  dataset.  Fig.  5  shows  the  average 
recognition  rates  with  standard  deviations  of  all  compared 
methods.  The  results  show  our  algorithm’s  generality  that 
the  locality  not  only  works  on  face  recognition  but  also  on 
object  categorization. 

MNIST  Dataset.  We  evaluate  our  algorithm  on  the  sub¬ 
set  of  MNIST  handwritten  digit  dataset  downloaded  from 


Eigure  5.  Recognition  rate(%)  with  standard  deviations  of  different 
algorithms  on  COIL- 100  dataset. 


CAD  website,  which  includes  first  2000  training  images  and 
first  2000  test  images  with  the  size  of  each  digit  image  is 
28  X  28.  This  experimental  setting  follows  [12],  and  we  get 
consistency  results.  The  recognition  rates  and  traing/testing 
time  by  different  algorithms  on  MNIST  dataset  are  sum¬ 
marized  in  Table  3.  Our  algorithm  achieves  the  highest 
accuracy  than  its  competitors.  Compared  within  the  top  4 
highest  accuracy  methods,  ours’  training  time  is  the  short¬ 
est  because  locality-constrained  method  only  updates  parts 
of  dictionary  atoms  each  time. 

4.2.  Experiments  on  LCAE 

We  report  experimental  results  based  on  four  datasets: 
three  widely  used  face  recognition  datasets  Extended  Yale- 
B,  AR,  CMU  PIE  and  one  newly  built  Virtual  MakeUp  (V- 
MU)  database.  For  all  these  datasets,  we  train  both  tradi- 
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Table  3.  Average  recognition  rate(%)  &  running  time(s)  on  MNIST 
dataset. 


Methods 

Accuracy 

Training  time 

Testing  time 

[12] 

84.23 

233.429 

84.863 

DLRD  [16] 

86.15 

243.271 

99.787 

FDDL  [27] 

84.85 

263.137 

388.219 

LDA  [2] 

72.30 

0.261 

0.576 

LRC  [18] 

82.70 

365.605 

- 

DPL  [7] 

83.60 

4.328 

0.125 

Ours 

88.75 

140.793 

88.180 

tional  auto-encoder  and  LCAE  separately,  then  use  S  VM  as 
classier  to  provide  the  recognition  rates.  For  Extended  Yale- 
B  and  AR  datasets,  we  follow  the  setting  in  above  section 
and  specifically  set  training  images  in  Extended  YaleB  as  10 
per  class.  For  CMU  PIE,  we  also  randomly  choose  10  im¬ 
ages  per  class  as  training  and  the  rest  images  as  testing.  The 
VMU  dataset  is  built  to  simulate  the  application  of  make¬ 
up  by  artificially  adding  makeup  to  51  female  Caucasian 
subjects  (show  in  Fig.  3  (d)).  There  are  4  makeup  statues 
(a)  no  makeup;  (b)  lipstick  only;  (c)  eye  makeup  only;  and 
(d)  a  full  makeup  including  lipstick,  foundation,  blush  and 
eye  makeup.  Hence,  the  assembled  dataset  contains  total 
204  images  and  four  images  per  subject.  We  randomly  se¬ 
lect  half  as  training  and  half  as  testing.  The  improvement  of 
recognition  rate  on  four  dataset  is  shown  in  Table  4.  We  ver¬ 
ify  the  effectiveness  of  the  locality  components  of  our  ap¬ 
proach  by  comparing  it  with  baseline  method  in  [20]  which 
only  differ  in  this  aspects.  We  can  see  the  LCAE  algorithm 
gets  higher  recognition  rate  by  introducing  local  informa¬ 
tion  into  the  built  auto-encoder,  which  enables  it  to  provide 
similar  inputs  similar  features. 

The  most  important  parameter  in  LCAE  is  k  which  used 
to  determine  how  many  local  atoms  of  a  dictionary  are  used 
to  reconstruct  the  target  in  each  iteration.  As  show  in  Fig.  6, 
the  effect  of  k  is  explored  on  AR  dataset,  k  =  0,  means  no 
locality  reconstruct  applied,  is  considered  as  baseline,  and 
k  =  5,25,45,65,85  are  tested  respectively.  We  can  see 
the  highest  accuracy  occurs  when  k  =  b,  and  the  accuracy 
decreases  along  with  the  increase  value  of  k.  When  k  = 
85,  means  all  the  dictionary  atoms  are  used,  the  accuracy 
falls  back  to  the  baseline,  which  is  desirable  since  the  local 
information  will  disappear  with  all  the  atoms  used. 

4.3.  Discussion 

From  above  experiments  on  our  two  proposed  algorithm- 
s,  we  could  find  that  locality  constraint  has  the  ability  to  im¬ 
prove  both  the  dictionary  learning  method  and  auto-encoder 
by  introducing  local  information.  For  LC-LRD,  our  method 
not  only  performs  good  on  clear  dataset  but  also  gets  better 


Figure  6.  Accuracy  on  AR  dataset  with  varying  k.  0  means  no 
locality  applied,  and  A;  =  85  means  all  the  dictionary  atoms  used. 

Table  4.  Average  recognition  rate(%)  of  AE  and  LCAE  on  different 
datasets. 


Methods 

YaleB 

AR 

PIE 

VMU 

AE 

73.82 

72.25 

68.05 

81.37 

LCAE[Ours] 

81.43 

84.36 

72.33 

86.27 

results  on  corrupted  data.  For  the  LCAE,  we  show  its  effec¬ 
tiveness  on  four  face  datasets  and  also  explore  its  properties 
by  varying  k's  value. 

5.  Conclusion 

This  paper  investigated  the  efficiency  of  locality- 
constrained  both  on  dictionary  learning  and  auto-encoder. 
We  first  presented  an  algorithm  which  iterative  learns  a  dis¬ 
criminative  sub-dictionaries  with  low-rank  regularization 
and  locality  constraint  on  coefficients.  By  applying  local¬ 
ity  constraint,  we  exploited  the  underlying  manifold  of  data 
space  and  dictionary  space  in  a  more  thorough  manner  than 
sparse  representation.  Second,  we  proposed  a  LCAE  algo¬ 
rithm  which  introduce  locality  constraint  to  the  target  layer 
of  auto-encoder.  Extensive  experiments  have  shown  that 
our  LC-LRD  method  outperforms  the  state-of-the-art  meth¬ 
ods  on  four  benchmarks  both  in  clean  and  corrupted  cases 
and  the  LCAE  also  has  the  ability  to  give  learned  feature 
local  information. 
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